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Abstract 

We study the problem of online learning with a notion of regret defined with respect to a 
set of strategies. We develop tools for analyzing the minimax rates and for deriving regret- 
minimization algorithms in this scenario. While the standard methods for minimizing the usual 
notion of regret fail, through our analysis we demonstrate existence of regret-minimization 
methods that compete with such sets of strategies as: autoregressive algorithms, strategies based 
on statistical models, regularized least squares, and follow the regularized leader strategies. In 
several cases we also derive efficient learning algorithms. 



1 Introduction 

The common criterion for evaluating an online learning algorithm is regret, that is the difference 
between the cumulative loss of the algorithm and the cumulative loss of the best fixed decision, 
chosen in hindsight. While much work has been done on understanding no-regret algorithms, such 
a definition of regret against a fixed decision often draws criticism: even if regret is small, the 
cumulative loss of a best fixed action can be large, thus rendering the result uninteresting. To 
address this problem, various generalizations of the regret notion have been proposed, including 
regret with respect to the cost of a "slowly changing" compound decision. While being a step in the 
right direction, such definitions are still "static" in the sense that the decision of each compound 
comparator per step does not depend on the sequence of realized outcomes. 

Arguably, a more interesting (and more difficult to deal with) notion is that of performing as well as 
a set of strategies (or, algorithms). A strategy vr is a sequence of functions vr^, for each time period t, 
mapping the observed outcomes to the next action. Of course, if the collection of such strategies is 
finite, we may disregard their dependence on the actual sequence and treat each strategy as a black 
box expert. This is precisely the reason the Multiplicative Weights and other expert algorithms 
gained such popularity. However, this "black box" approach is not always desirable since some 
measure of the "effective number" of experts must play a role in the complexity of the problem: 
experts that predict similarly should not count as two independent ones. But what is a notion of 
closeness of two strategies? Imagine that we would like to develop an algorithm that incurs loss 
comparable to that of the best of an infinite family of strategies. To obtain such a statement, one 
may try to discretize the space of strategies and invoke the black-box experts method. As we show 
in this paper, such an approach will not always work. Instead, we present a theoretical framework 
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for the analysis of "competing against strategies" and for algorithmic development, based on the 
ideas in [11, 9]. 

The strategies considered in this paper are termed "simulatable experts" in [3]. The authors also 
distinguish static and non-static experts. In particular, for static experts and absolute loss, [2] were 
able to show that problem complexity is governed by the geometry of the class of static experts as 
captured by its i.i.d. Rademacher averages. For nonstatic experts, however, the authors note that 
"unfortunately we do not have a characterization of the minimax regret by an empirical process", 
due to the fact that the sequential nature of the online problems is at odds with the i.i.d. -based 
notions of classical empirical process theory. In recent years, however, a martingale generalization 
of empirical process theory has emerged, and these tools were shown to characterize learnability 
of online supervised learning, online convex optimization, and other scenarios [11, 1]. Yet, the 
machinery developed so far is not directly applicable to the case of general simulatable experts 
which can be viewed as mappings from an ever-growing set of histories to the space of actions. The 
goal of this paper is precisely this: to extend the non-constructive as well as constructive techniques 
of [11, 9] to simulatable experts. We analyze a number of examples with the developed techniques, 
but we must admit that our work only scratches the surface. We can imagine further research 
developing methods that compete with interesting gradient descent methods (parametrized by step 
size choices), with Bayesian procedures (parametrized by choices of priors), and so on. We also note 
the connection to online algorithms, where one typically aims to prove a bound on the competitive 
ratio. Our results can be seen in that light as implying a competitive ratio of one. 

We close the introduction with a high-level outlook, which builds on the ideas of [8]. Imagine 
we are faced with a sequence of data from a probabilistic source, such as a fc-Markov model with 
unknown transition probabilities. A well developed statistical theory tells us how to estimate the 
parameter under the assumption that the model is correct. We may view an estimator as a strategy 
for predicting the next outcome. Suppose we have a set of possible models, with a good prediction 
strategy for each model. Now, let us lift the assumption that the sequence is generated by one of 
these models, and set the goal as that of performing as well as the best prediction strategy. In this 
case, if the observed sequence is indeed given by one of the models, our loss will be small because 
one of the strategies will perform well. If not, we still have a valid statement that does not rely 
on the fact that the model is "well specified". To illustrate the point, we will exhibit an example 
where we can compete with the set of all Bayesian strategies (parametrized by priors). We then 
obtain a statement that we perform as well as the best of them without assuming that the model 
is correct. 

The paper is organized as follows. In Section 2, we extend the minimax analysis of online learning 
problems to the case of competing with a set of strategies. In Section 3, we show that it is possible 
to compete with a set of autoregressive strategies, and that the usual online linear optimization 
algorithms do not attain the optimal bounds. We then derive an optimal and computationally 
efficient algorithm for one of the proposed regimes. In Section 4 we describe the general idea 
of competing with statistical models that use sufficient statistics, and demonstrate an example 
of competing with a set of strategies parametrized by priors. For this example, we derive an 
optimal and efficient randomized algorithm. In Section 5, we turn to the question of competing 
with regularized least squares algorithms indexed by the choice of a shift and a regularization 
parameter. In Section 6, we consider online linear optimization and show that it is possible to 
compete with Follow the Regularized Leader methods parametrized by a shift and by a step size 
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schedule. 



2 Minimax Regret and Sequential Rademacher Complexity 

We consider the problem of online learning, or sequential prediction, that consists of T rounds. 
At each time t - {1, . . . ,T} = [T], the learner makes a prediction ft^T and observes an outcome 
zt ^ where T and Z are abstract sets of decisions and outcomes. Let us fix a loss function 
£ : ^ X Z M that measures the quality of prediction. A strategy vr = (vrt)^i is a sequence of 
functions vrt : Z^^^ t-^ T mapping history of outcomes to a decision. Let H denote a set of strategies. 
The regret with respect to n is the difference between the cumulative loss of the player and the 
cumulative loss of the best strategy 

T T 

Regj. = Y^^ift^^t) - inf ^£(7rj(2;i:t_i),zt). 

t=l 7rent=l 

where we use the notation zi^^ = {zi, . . . ,Zk}- We now define the value of the game against a set H 
of strategies as 

VT(n) = inf sup E , 

qi^QzieZ fi~qi 



. inf sup E [Reg'p] 

qt^Q zt^Z Jt-qt 



where Q and V are the sets of probability distributions on and Z, correspondingly. It was 
shown in [11] that one can derive non-constructive upper bounds on the value through a process 
of sequential symmetrization, and in [9] it was shown that these non-constructive bounds can be 
used as relaxations to derive an algorithm. This is the path we take in this paper. 

Let us describe an important variant of the above problem - that of supervised learning. Here, 
before making a real-valued prediction yt on round t, the learner observes side information xt e X. 
Simultaneously, the actual outcome yt ^ 3^ is chosen by Nature. A strategy can therefore depend 
on the history xi:t-i,yt_i and the current xt, and we write such strategies as 7rt(xi:t,yi:t-i), with 
TTi : A"* X 3^*"^ h-s- 3^. Fix some loss function £{y,y). The value V^(n) is then defined as 



sup inf sup E , 

XI qieA(y)yieyyi~qi 



. sup inf sup E 

XT qT<^^{y)yT^y vt-qt 



T T 

Y^^{yt,yt) - mfY,KMxi:t,yi:t-i),yt) 

it=l TTeUt=l 



To proceed, we need to define a notion of a tree. A Z-valued tree z is a sequence of mappings 
{zi, . . . , zt} with Zf : {±1}*"^ 1-^ Z. Throughout the paper, e {±1} are i.i.d. Rademacher variables, 



and a realization of e = (ei, . . . ,eT) defines a path on the tree, given by zi:t(e) = (zi(e), . . . ,zt{e)) for 
any t e [T]. We write zj(e) for zj(ei:j_i). By convention, a sum X!a ~ a > b and for simplicity 
assume that no loss is suffered on the first round. 

Definition 1. Sequential Rademacher complexity of the set 11 of strategies is defined as 



fH(£,n) =supEeSup 

w,z tteH 



^et^(7rj(wi(e), . . . , wj_i(e)), zj(e)) 



(1) 



where the supremum is over two 2^-valued trees z and w of depth T. 
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The w tree can be thought of as providing "history" while z providing "outcomes". We shall 
use these names throughout the paper. The reader might notice that in the above definition, the 
outcomes and history are decoupled. We now state the main result: 

Theorem 1. The value of prediction problem with a set 11 of strategies is upper bounded as 

VT(n) <2iH(^,n) 



While the statement is visually similar to those in [11, 12], it does not follow from these works. 
Indeed, the proof (which appears in Appendix) needs to deal with the additional complications 
stemming from the dependence of strategies on the history. Further, we provide the proof for a 
more general case when sequences zi,. . . ,zt are not arbitrary but need to satisfy constraints. 

As we show below, the sequential Rademacher complexity on the right-hand side allows us to 
analyze general non-static experts, thus addressing the question raised in [2]. As the first step, we 
can "erase" a Lipschitz loss function (see [10] for more details), leading to the sequential Rademacher 
complexity of 11 without the loss and without the z tree: 



lH(n) = sup9^(n,w) = sup E, sup 



^et7ri(vi^i:t_i(e)) 



Tren Lt=l 

For example, suppose Z - {0,1}, the loss function is the indicator loss, and strategies have poten- 
tially dependence on the full history. Then one can verify that 



supEe sup 



Y,etl{TTt{v/i:t-i{e)) t zt(e)} 



i=l 



SUpEe sup 



Eet(vri(wi.t_i(6))(l-2zt(6))+Zi(e)) 



9^(n) 



(2) 



The same result holds when T - [0, 1] and £ is the absolute loss. The process of "erasing the loss" 
(or, contraction) extends quite nicely to problems of supervised learning. Let us state the second 
main result: 



Theorem 2. Suppose the loss function i : 3^x3^ 
and let y - [-1, 1]. Then 

V|(n) < 2L sup E sup 



is convex and L-Lipschitz in the first argument, 



X]et7rt(xi:t(e),yi:t_i(e)) 



x,y e Tren Lt=l 

where (xi:t(e), yi:(_i (e)) naturally takes place of wi:t-i{e) in Theorem 1. Further, if y - [-1,1] 
ande{y,y) = \y-y\, V|(n) > sup^E^sup^.n [EEi et7rt(xi:f(e),ei:t_i)]. 



Let us present a few simple examples as a warm-up. 

Example 1 (History- independent strategies). Letir^ e 11 6e constant history-independent strategies 
TT^ - . . . - TT^ = f € J^. Then (1) recovers the definition of sequential Rademacher complexity in 
[11]. 
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Example 2 (Static experts). For static experts, each strategy ir is a predetermined sequence of 
outcomes, and we may therefore associate each vr with a vector in . A direct consequence of 
Theorem 2 for any convex L-Lipschitz loss is that 



V(n) < 2LE, sup 

Tren 



T 

t=l 



which is simply the classical i.i.d. Rademacher averages. For the case of J- - [0, 1], Z - {0, 1}, and 
the absolute loss, this is the result of [2]. 

Example 3 (Finite-order Markov strategies). Let be a set of strategies that only depend on the 
k most recent outcomes to determine the next move. Theorem 1 implies that the value of the game 
is upper bounded as 

V{U'') < 2 sup^,^ Ee sup^.nfe [^^=1 et^(^t ( wt-fc (e) , . . . , wt-i (e) ) , (e) )] 

k 

Now, suppose that Z is a finite set, of cardinality s. Then there are effectively s'^ strategies it. The 
bound on the sequential Rademacher complexity then scales as \/2s'^ \og{s)T , recovering the result 
of [5] (see [3, Cor. 8.2]). 

In addition to providing an understanding of minimax regret against a set of strategies, sequential 
Rademacher complexity can serve as a starting point for algorithmic development. As shown in 
[9], any admissible relaxation can be used to define a succinct algorithm with a regret guarantee. 
For the setting of this paper, this means the following. Let Rel : 2^* R, for each t, be a collection 
of functions satisfying two conditions: 

Vt, inf sup I E i{ft,zt) + Yie\{zi..t)\ < Kel{zi..t-i), and - inf Y.^M^^'-t-i)^ ^t) < Re^zi^T) . 

qteQzt^Z [ft~qt J vr£n(=l 

Then we say that the relaxation is admissible. It is then easy to show that regret of any algorithm 
that ensures above inequalities is bounded by Rel({}). 

Theorem 3. The conditional sequential Rademacher complexity with respect to H 



y{{i,Il\zi, . . . , Zf) ^ sup E sup 



T t 

2 es^{'n-s{{zi;t,Wi;s-t-l{€)),Zs-t{e)) -Y,£{tTs{zi:s-i),Zs) 



z,w et+i:T Tren L s=t+l s=l 

is admissible. 

Conditional sequential Rademacher complexity can therefore be used as a starting point for possibly 
deriving computationally attractive algorithms, as shown throughout the paper. 

We may now define covering numbers for the set IT of strategies over the history trees. The 
development is a straightforward modification of the notions we developed in [11], where we replace 
"any tree x" with a tree of histories wi:t_i. 

Definition 2. A set V of M-valued trees is an a-cover (with respect to ip) of a set of strategies IT 
on an >Z* -valued history tree w if 

V^en, Vee{±l}^, 3v e y s.t. J^t(wi.,_i(e)) - vt(e)|P)'^^ < a . (3) 

An a-covering number J\fp(Jl,w,a) is the size of the smallest a-cover. 
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For supervised learning, (xi:f(e),yi:t_i(e)) takes place of wi:f_i(e). Now, for any history tree w, 
sequential Rademacher averages of a class of [-1, l]-valued strategies 11 satisfy 

lH(n, w) < inf laT + A/21ogA/'i(n,w,a)rj 

and the Dudley entropy integral type bound also holds: 

m{U, w) < inf (4qT + uVt [ ^ Vlog A/2(n,w,(5) d6} (4) 

In particular, this bound should be compared with Theorem 7 in [2], which employs a covering 
number in terms of a pointwise metric between strategies that requires closeness for all histories 
and all time steps. Second, the results of [2] for real-valued prediction require strategies to be 
bounded away from and 1 by 5 > and this restriction spoils the rates. 

In the rest of the paper, we show how the results of this section (a) yield proofs of existence of regret- 
minimization strategies with certain rates and (b) guide in the development of algorithms. For some 
of these examples, standard methods (such as Exponential Weights) come close to providing an 
optimal rate, while for others - fail miserably. 



3 Competing with Autoregressive Strategies 

In this section, we consider strategies that depend linearly on the past outcomes. To this end, we 
fix a set c M'^, for some A; > 0, and parametrize the set of strategies as 

For consistency of notation, we assume that the sequence of outcomes is padded with zeros for 
t < 0. First, as an example where known methods can recover the correct rate, we consider the 
case of a constant look-back of size k. We then extend the study to cases where neither the regret 
behavior nor the algorithm is known in the literature, to the best of our knowledge. 



3.1 Finite Look-Back 



Suppose Z = cM are £2 unit balls, the loss is i{f, z) = (/, z), and c M is also a unit £2 ball 
Denoting by ^u-k-.t-i) - [wi-fc(e), . . . ,Wi_i(e)] a matrix with columns in Z, 



m{i,Ue) = sup Ee sup _ 

w,z 0e& lt=l 

sup E, 



X^et (7r^(wi_fc:j_i(e)),zt(e)) 



sup Ee sup 



ee lt=l 



{t-k:t-l) • ' 



< VkT 



(5) 



In fact, this bound against all strategies parametrized by is achieved by the gradient descent 
(GD) method with the simple update 0t+i - Proj@(0t - 7][zt-k, ■ ■ ■ , zt-iY zt) where ProjQ is the 
Euclidean projection onto the set 0. This can be seen by writing the loss as 

{[zt-k,- ■ ■ , zt-i] ■ 6t, zt) = {Ot, [zt-k, zt-iY zt). 

The regret of GD, ZLii^t, [zt-k, ■ ■ ■, zt-if zt) - inf q^q Et=i{9, [zt-k, ■ ■ -^zt-if zt), is precisely regret 
against strategies in 0, and analysis of GD yields the rate in (5). 
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3.2 Full Dependence on History 



The situation becomes less obvious when k - T and strategies depend on the full history. The 
regret bound in (5) is vacuous, and the question is whether a better bound can be proved, under 
some additional assumptions on Q. Can such a bound be achieved by GD? 

For simplicity, consider the case of J- - Z - [-1,1], and assume that = Bp{l) c is a unit £p 
ball, for some p>l. Since k -T, it is easier to re-index the coordinates so that 

The sequential Rademacher complexity of the strategy class is 



9^(£,ne) = supEsup 



^et7r^(v^i:t_i(e)) •zt(e) 



eee lt=i 

Rearranging the terms, the last expression is equal to 



sup E sup 
w,z e^Q 



ErE^.W,(e))6*Z,(6) 



.t=l \i=l 



sup E sup 

w,z 6»e0 



< supE 

w,z 



where q is the Holder conjugate of p. Observe that 

< supE 



supE sup 

z l<t<T 



T 

i=t 



T 



+ sup 

l<t<T 



||wi:T-i(e) 



t-1 
«=1 



„ • max 

l<t<T 



T 



< 2 sup E sup 

z l<t<T 



i=l 



Since {etZt{e) ■ t - 1, . . . ,T} is a, bounded martingale difference sequence, the last term is of the 
order of 0{\/T). Now, suppose there is some /3 > such that ||wi:T-i(e)||g < for all e. This 
assumption can be implemented if we consider constrained adversaries, where such ^g-bound is 
required to hold for any prefix wi:t(e) of history (In Appendix, we prove Theorem 1 for the case of 
constrained sequences). Then < C-T^^^l"^ for some constant C. We now compare the rate 

of convergence of sequential Rademacher and the rate of the mirror descent algorithm for different 
settings of q in Table 3.2. If < 1 and ||w||g < for q>2, the convergence rate of mirror descent 
with Legendre function F{9) = \ \\e\\l is ^/q^Tl^^^l'^ (see [13]). 



e 




sequential Radem. rate 


Mirror descent rate 


Bi{l) 


Wl:r_l cx> < 1 


Vt 


v/Tiogr 


q>2 Bp{l) 


||wi:r-l||g < 


21/3+1/2 




B2{1) 


Wl:T-l 2 < T'^ 


2^/3+1/2 


2^/3+1/2 


l<q<2 Bp{l) 


||wi:T-l||5 < 


2^/3+1/2 


21/3+1/q 


5oo(l) 


||wi:T-l||l < 


2^/3+1/2 


T 



Table 1: Comparison of the rates of convergence (up to constant factors) 

We observe that mirror descent, which is known to be optimal for online linear optimization, and 
which gives the correct rate for the case of bounded look-back strategies, in several regimes fails 
to yield the correct rate for more general linearly parametrized strategies. Even in the most basic 
regime where is a unit ii ball and the sequence of data is not constrained (other than Z - [-1, 1]), 
there is a gap of \/logT between the Rademacher bound and the guarantee of mirror descent. Is 
there an algorithm that removes this factor? 
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3.2.1 Algorithms for 9 = Bi{l) 



For the example considered in the previous section, with T - Z - [-1,1] and = i3i(l), the 
conditional sequential Rademacher complexity of Theorem 3 becomes 



9^T(n|zi, . . . , Z() = sup E sup 

z,w et+i:T Trell 

< sup E sup 



T t 

T t 
2 es7rs(2;i:t,Wi:s_t_i(e)) - ^ 2;s7rs(2;i:s-l) 



s=i+l 



where the z tree is "erased", as at the end of the proof of Theorem 2. Define as(e) = 2e^ for s> t 
and -Zg otherwise; 6i(e) = Wj(e) for i>t and otherwise. We can then simply write 



sup E sup 



f;a,(6)5:^A(e) 



= sup E sup 



w et+i:T SeG Ls=1 i=l 

which we may use as a relaxation: 

Lemma 4. Define o*(e) = 26^ /or s>t, and -Zg otherwise. Then, 

Rel(zi:t) = E,^^^^^ maxi<^<T |e£s ai(e)| 

is an admissible relaxation. 



< E max 



T 



With this relaxation, the following method attains 0{\/T) regret: prediction at step t is 



qt - argmin sup \ E ff Zt + E^j^-^.y max 

gE[-l,l] ^tE{±l} ■ IS^ST 



where the sup over zt e [-1,1] is achieved at {±1} due to convexity. Following [9], we can also 
derive randomized algorithms, which can be viewed as "randomized playout" generalizations of the 
Follow the Perturbed Leader algorithm. 

Lemma 5. Consider the randomized strategy where at round t we first draw ej+i, ■ ■ ■ ,€t uniformly 
at random and then further draw our move ft according to the distribution 

qt{e) = argmin sup2^e{_i_i} {Ej^^^/t • zt + maxi<s<T |EL 

- max{maxs=i,...,f Zi-l + 2EL+i M ' ^^^s=t+i,...,T\2Z%s^i\}) 

The expected regret of this randomized strategy is upper bounded by sequential Rademacher com- 
plexity: E [Regji] < 21Hr(n), which was shown to be 0{\/T) (see Table 3.2). 

The time consuming parts of the above randomized method are to draw T -t random bits at round 
t and to calculate the partial sums. However, we may replace Rademacher random variables by 
Gaussian 7V^(0, 1) random variables and use known results on the distributions of extrema of a 
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Brownian motion. To this end, define a Gaussian analogue of conditional sequential Rademacher 
complexity 



Qt{II\zi, . . . , Zf) - sup E sup 



z,w <Jt+i:T Trell 



2tT 0-s^(7rs((zi:t, Wi: 

s=t+l 



where at ~ A/'(0, 1), and e = (sign((Ti), . . . ,sign(crT))- For our example the 0{\/T) bound can be 
shown for ^j'(n) by calculating the expectation of the maximum of Brownian motion. Proofs similar 
to Theorem 1 and Theorem 3 show that the conditional Gaussian complexity QtO^Izi, ■ ■ ■ ,zt) is 
an upper bound on lHr(n|zi, ■ ■ ■ ,zt) and is admissible (see Theorem 9 in Appendix). Furthermore, 
the proof of Lemma 5 holds for Gaussian random variables, and gives the randomized algorithm 
as in Lemma 5 with et replaced by at- It is not difficult to see that we can keep track of the 
maximum and minimum of {-T,i=l ^i} between rounds in 0(1) time. We can then draw three 
random variables from the joint distribution of the maximum, the minimum and the endpoint of 
a Brownian Motion and calculate the prediction in 0(1) time per round of the game (the joint 
distribution can be found in [7]). In conclusion, we have derived an algorithm that for the case of 
@ - Bi{l), with time complexity of 0(1) per round and the optimal regret bound of 0(\/T). We 
leave it as an open question to develop efficient and optimal algorithms for the other settings in 
Table 3.2. 



4 Competing with Statistical Models 

In this section we consider competing with a set of strategies that arise from statistical models. 
For example, for the case of Bayesian models, strategies are parametrized by the choice of a prior. 
Regret bounds with respect to a set of such methods can be thought of as a robustness statement: 
we are aiming to perform as well as the strategy with the best choice of a prior. We start this 
section with a general setup that needs further investigation. 



4.1 Compression and Sufficient Statistics 

Assume that strategies in 11 have a particular form: they all work with a "sufficient statistic", 
or, more loosely, compression of the past data. Suppose "sufficient statistics" can take values in 
some set F. Fix a set n of mappings n ■ T J^. We assume that all the strategies in 11 are of 
the form 7rt(zi, . . . , zt-i) - Tt{-f{zi, . . . , zt-i)) for some vf e II and ^ ■ Z* i-^ T. Such a bottleneck 
F can arise due to a finite memory or finite precision, but can also arise if the strategies in 11 are 
actually solutions to a statistical problem. If we assume a certain stochastic source for the data, we 
may estimate the parameters of the model, and there is often a natural set of sufficient statistics 
associated with it. If we collect all such solutions to stochastic models in a set 11, we may compete 
with all these strategies as long as F is not too large and the dependence of estimators on these 
sufficient statistics is smooth. With the notation introduced in this paper, we need to study the 
sequential Rademacher complexity for strategies 11, which can be upper bounded by the complexity 
of n on F- valued trees: 



9^(n) <supEeSup 



i=l 
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This complexity corresponds to our intuition that with sufficient statistics the dependence on the 
ever-growing history can be replaced with the dependence on a summary of the data. Next, we 
consider one particular case of this general idea, and refer to [6] for more details on these types of 
bounds. 



4.2 Bernoulli Model with a Beta Prior 



Suppose the data zt e {0, 1} is generated according to Bernoulli distribution with parameter p, and 
the prior on p e [0, 1] is p ~ Beta{a, /3). Given the data {zi, . . . , zt-i}, the maximum a posteriori 
(MAP) estimator of p is p - (Ei=i Zi + a - l)/{t - 1 + a + (3 - 2). We now consider the problem 
of competing with 11 = {tt'^'^ : q > 1,/3 e (l,C/3]} for some Cfs, where each vr"''' predicts the 
corresponding MAP value for the next round: 

TT^'^zu. . . , zt.i) = (Eti + « - - 1 + a + /3 - 2) . 

Let us consider the absolute loss, which is equivalent to probability of a mistake of the randomized 
prediction^ with bias tt"'^. Thus, the loss of a strategy tt"'^ on round t is tt^''^ {zi;t-i) - zt ■ Using 
Theorem 1 and the argument in (2) to erase the outcome tree, we conclude that there exists a 
regret minimization algorithm against the set 11 which attains regret of at most 



2 sup^ E, supo 



To analyze the rate exhibited by this upper bound, construct a new tree with gi(e) = 1 and 

g«(e) = f7i-2^""^ ^ [0' 1] foi' * ^ 2. With this notation, we can simply re-write the last expression 
as twice 

supgE, sup^^ff [e^^i ^t^t{e) tllTp-^ 

The supremum ranges over all [0, l]-valued trees g, but we can pass to the supremum over all [-1, 1]- 
valued trees (thus making the value larger). We then observe that the supremum is achieved at 
a {±l}-valued tree g, which can then be erased as in the end of the proof of Theorem 2 (roughly 
speaking, it amounts to renaming tt into et^t{^i:t-i)) ■ We obtain an upper bound 



^(n)<E.supE ;f 7:'^ <E. 



T 



+ lEeSup =(VC> + l)^/T (6) 



et(/3-l) 



r{t + a + 13-?, 

where we used Cauchy-Schwartz inequality for the second term. We note that an experts algorithm 
would require a discretization that depends on T and will yield a regret bound of order 0{\/T logT). 
It is therefore interesting to find an algorithm that avoids the discretization and obtains this regret. 
To this end, we take the derived upper bound on the sequential Rademacher complexity and prove 
that it is an admissible relaxation. 



Lemma 6. The relaxation 

Rel(2;i:j) = ^et+^.T sup 

is admissible. 



T 

2 E 

s=t+l 



s + a-2 ^ 
s + a + P - 3 ~{ 



s + a + P - 2, 



^Alternatively, we can consider strategies that predict according to 1 {p > 1/2}, which better matches the choice of 
an absolute loss. However, in this situation, an experts algorithm on an appropriate discretization attains the bound. 
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Given that this relaxation is admissible, we have a guarantee that the following algorithm attains 
the rate {\/Cp + 1)\/T given in (6): 



gt=argmin max E/^g|/ - Z(| + Eej_,i^j, sup 



g£[0,l] zte{0,l} 

In fact, qt can be written as 



s + a-2 



T 

skii ' s + a + /3-3 



1 



9t = o i IEe,+i.T sup 



•E, 



et+i:T sup 
a,/3 



T 

r 



Ed -2..) ^-^^ 



t-1 

s + Q + /3 - 3 



s + a + /? - 3 



s + Q! + /3-3 t + a + P 



s + a- 



t-i 



s + a + /? - 3 



E(i-2^.) 



s + a + /3-3 t + a + /3-3 



For a given realization of random signs, the supremum is an optimization of a sum of linear fractional 
functions of two variables. Such an optimization can be carried out in time ©(TlogT) (see [4]). 
To deal with the expectation over random signs, one may either average over many realizations or 
use the random playout idea and only draw one sequence. Such an algorithm is admissible for the 
above relaxation, obtains the 0{\/T) bound, and runs in 0(T log T) time per step. We leave it as 
an open problem whether a more efficient algorithm with 0{\/T) regret exists. 



5 Competing with Regularized Least Squares 

Consider the supervised learning problem with 3^ = [-1, 1] and some set X. Consider the Regular- 
ized Least Squares (RLS) strategies, parametrized by a regularization parameter A and a shift wq. 
That is, given data . . . , {xt,yt), the strategy solves 

argmin^„ ZLiiVi - {xi,w))^ + X\\w - wof . 

For a given pair A and wq, the solution is 

wfl"^" ^wo + {x^x + xiy^x^Y, 

where X e M*'''^ and Y e M*'^^ are the usual matrix representations of the data xi:t,yi:t- We would 
like to compete against a set of such RLS strategies which make prediction ^t); given side 

information xj. Since the outcomes are in [-1, 1], without loss of generality we clip the predictions 
of strategies to this interval, thus making our regret minimization goal only harder. To this end, 
let c(a) = a if a e [-1,1] and c(a) = sign(a) for \a\ > 1. Thus, given side-information xt e X, the 
prediction of strategies in 11 = {vr'^''"" : A > Amin > 0, H'Wolb ^ l} is simply the clipped product 

Let us take the squared loss function i{y,y) - {y - y)^ ■ 

Lemma 7. For the set 11 of strategies defined above, the minimax regret of competing against 
Regularized Least Squares strategies is 

VT(n) < c^Tiog(rA^ij 

for an absolute constant c. 
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Observe that X'^i^^ enters only logarithmically, which allows us to set, for instance, Amin = l/T. 
Finally, we mention that the set of strategies includes A = oo. This setting corresponds to a 
static strategy 7rf'^°{xi:t,yi:t-i) = {wo,xt) and regret against such a static family parametrized by 
wq e -62(1) is exactly the objective of online linear regression [14]. Lemma 7 thus shows that it is 
possible to have vanishing regret with respect to a much larger set of strategies. It is an interesting 
open question of whether one can develop an efficient algorithm with the above regret guarantee. 



6 Competing with Follow the Regularized Leader Strategies 

Consider the problem of online linear optimization with the loss function i{ft,xt) - {ft,zt) for 
ft € J^, Zi ^ Z. For simplicity, assume that T - Z - B2(X). An algorithm commonly used 
for online linear and online convex optimization problems is the Follow the Regularized Leader 
(FTRL) algorithm. We now consider competing with a family of FTRL algorithms 7r"'°''^ indexed 
by vjQ € {w ■ \\w\\ < 1} and A e A where A is a family of functions A : M'*' x [T] M+ specifying a 
schedule for the choice of regularization parameters. Specifically we consider strategies yr"'"'''' such 
that 7rJ""''^(zi, . . . ,zt-i) - wt where 

■wt = uiQ + argmin {Zti i'^^ ^i) + l^{\\T,i=i Zi\\ ,t) \\wf} (7) 

■!ii:||lil||<l 

This can be written in closed form as wt - wq - (E^Ii Zi)/ max {A ( \\T,iZi Zi\\ ,t) , \\T,iZl Zi\\}- 

Lemma 8. For a given class A of functions indicating choices of the regularization parameters, 
define a class T of functions on [0,1] x [l/T, 1] specified by 

r = {, : V. . [l/T, l].a . [0, = oin {^^^-M^lil^, l} . A . A 

Then the value of the online learning game competing against FTRL strategies given by Equation 

7 is bounded as 

VtCHa) <4 \/r + 2 7^T(^) 

where TZt{T) is the sequential Rademacher complexity [11] ofV. 

Notice that if |A| < 00 then the second term is bounded as TZt{^) ^ \/T log |A|. However, we may 
compete with an infinite set of step-size rules. Indeed, each 7 e F is a function [0,1]^ [0,1]. 
Hence, even if one considers F to be the set of all 1-Lipschitz functions (Lipschitz w.r.t., say, £00 
norm), it holds that 7^T(r) ^ 2^yT logT . We conclude that it is possible to compete with set of 
FTRL strategies that pick any wq in unit ball as starting point and further use for regularization 
parameter schedule any A : M that is such that \(^a/(jy_i)ilb) ^ 1-Lipchitz function for every 
a,6€[l/r,l]. 

Beyond the finite and Lipschitz cases shown above, it would be interesting to analyze richer families 
of step size schedules, and possibly derive efficient algorithms. 
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A Proofs 



Proof of Theorem 1 . Let us prove a more general version of Theorem 1 , which we do not state 
in the main text due to lack of space. The extra twist is that we allow constraints on the sequences 
zi,...,zt played by the adversary. Specifically, the adversary at round t can only play xt that 
satisfy constraint Ct{zi, . . . , zt) - 1 where (Ci, . . . , Ct) is a predetermined sequence of constraints 
with Ct'Z^^ {0, 1}. When each Ct is the function that is always 1 then we are in the setting of the 
theorem statement where we play an unconstrained/ worst case adversary. However the proof here 
allows us to even analyze constrained adversaries which come in handy in many cases. Following 
[12], a restriction Vi-.t on the adversary is a sequence Vi, . . . jVt of mappings Vt ■ Z^^^ ^ 2^ 
such that Vt{zi:t-i) is a convex subset of V for any zi;t-i e Z^^^ . In the present proof we will 
only consider constrained adversaries, where Vt - A(Ct{zi;t-i)) the set of all distributions on the 
constrained subset 

Ct{zi:t-i) - {z e Z : Ct{zi,.. . ,zt-i,z) ^ 1}. 

defined at time t via a binary constraint Ct-Z^^ {0, 1}. Notice that the set Ct{zi;t-i) is the subset 
of Z from which the adversary is allowed to pick instance zt from given the history so far. It was 
shown in [12] that such constraints can model sequences with certain properties, such as slowly 
changing sequences, low-variance sequences, and so on. Let C be the set of 2-valued trees z such 
that for every e e {±1}^ and t e [T], 

C7t(zi(e),...,zt(e)) = l, 

that is, the set of trees such that the constraint is satisfied along any path. The statement we 
now prove is that the value of the prediction problem with respect to a set 11 of strategies and 
against constrained adversaries (denoted by VT(n,Ci:T)) is upper bounded by twice the sequential 
complexity 

T 

sup Eesup^et^(7rt(wi(e),... ,wt_i(e))),zt(e)) (8) 

weC,z tteH t=l 

where it is crucial that the w tree ranges over trees that respect the constraints along all paths, 
while z is allowed to be an arbitrary 2^-valued tree. This fact that w respects the constraints is the 
only difference with the original statement of Theorem 1 in the main body of the paper. 

For ease of notation we use (( to denote repeated application of operators such has sup or inf. 
For instance, llsnY>a,eA'^'Q-^bt<^B^rt~p\^i[F{0'iM,'ri, ...,aT M i^t)] denotes 

sup inf Er^~p . . . sup inf Erj,~p[F(ai,6i,ri, ...,aT,6T,'^T)] 
The value of a prediction problem with respect to a set of strategies and against constrained 
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.t=l 



adversaries can be written as : 

VT(n,Ci:T) = ((inf sup E \\ 

sup E 1) sup 

< (( sup E )) sup 



Y,^{ft,zt) - inf (-{'Kt{zi:t-l),Zt) 



sup 



E 



t=i 

T 



^inf ^Muz[)-i{'Kt{z^..t-i),zt) 



Y,^z'/i.n{zirt-l),z[) - l^-Ktizirt-lj^Zt) 

.t=l 



sup 



T 



Yj(-{T^t{zi:t-l),z[)-l{'Kt{ziyt-l),Zt) 



\pte7't(zi:t-i)^''^t//j^l Tren Lt=l 

Let us now define the "selector function" x = -2 x ^ x {±1} ^ Z hy 



X{z,z',e) 



z' ife = -l 
z if e = 1 



In other words, Xt selects between zt and z[ depending on the sign of e. Wc will use the shorthand 
Xt(et) - Xi.zuz[,et) and Xi:t(ei:t) = (xC^^i, , ei), • • • ^t, e*)). We can then re-write the last 

statement as 



(^(^t(xi:t-i(ei:t-i)),Xt(-et))-^(7rt(xi:t-i(ei:t-i)),Xt(et))) 

.t=i 



sup EE)) sup 



One can indeed verify that we simply used xt to switch between zt and z[ according to ej. Now, 
we can replace the second argument of the loss in both terms by a larger value to obtain an upper 
bound 



sup 



E sup E)) sup 



E6i(£(7ri(xi:t-l(ei:t-l)),4')-^(^t(Xl:t-l(ei=t-l)),^r)) 



<2|/ sup E supE I sup 



Y.^A^^t{Xl■.t-l{^l■.t-l)),z^) 
t=i 



since the two terms obtained by splitting the suprema are the same. We now pass to the suprema 
over zt.,z[, noting that the constraints need to hold: 



sup sup E 1 1 sup 

T 

2 sup sup E sup 

(z,z')sC' z" e Tren 



E ^AMXl:t-liei:t-l)),Zt) 



.t=l 



E (x(zi , z'l , ei ),..., x(zt-i (e) , zj_i (e) , et_i )), z"(e) 



t=i 



= (.) 



where in the last step we passed to the tree notation. Importantly, the pair (z,z') of trees does not 
range over all pairs, but only over those which satisfy the constraints: 

C = {(z, z') : Ve € {±1}^, Vi € [T], zt(e), z;(e) e Ct(x(zi, z'^, ei), . . . , x(zt-i(e), z;_i(e), et_i))} 
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Now, given the pair (z, z') e C, define a Z-valued tree of depth T as: 

wi = 0, wt(e) = x(zt-i(e),z^_i(e),et_i) for all t>l 

Clearly, this is a well-defined tree, and we now claim that it satisfies the constraints along ev- 
ery path. Indeed, we need to check that for any e and t, both wt(ei:t-_2, +1), Wf(ei:t_2, -1) ^ 
Ct(wi, . . . , wj_i(ei:t_2))- This amounts to checking, by definition of w and the selector x, that 

zt-i(ei:t-2),zt_i(ei:t_2) £ Ct-i(x(zi,z'i,ei), • • • , x(zi-2(e), Zt_2(e), £4-2)) • 

But this is true because (z,z') e C' . Hence, w constructed from z,z' satisfies the constraints along 
every path. 

We can therefore upper bound the expression in (*) by twice 



sup sup E sup 

weC z" e TreH 



^ et£(7rj(wi(e), . . . , Wi_i(e)), z"(e)) 



Define w* = w(-l) and w** = w(+l), we can expend the expectation with respect to ei of the 
above expression by 



- sup sup E sup 

2 w*eC z" e2:T TreH 



+ - sup sup E sup 

2 w**eC z" e2:T Tren 



(vri(.),z'/(.)) + E^*^U*(w*(e)),z"(6)) 



t=2 



(vri(.),z'/(.)) + E6,^(vrt(w"(6)),z"(6)) 



t=2 



With the assumption that we do not suffer lose at the first round, which means £(7ri(-), z"(-)) = 0, 
we can see that both terms achieve the suprema with the same w* = w**. Therefore, the above 
expression can be rewrite as 



sup sup E sup 

w£C z" e2:T Tren 



i=l 



which is precisely (8). This concludes the proof of Theorem 1. 



Proof of Theorem 2. By convexity of the loss, 



□ 



sup inf sup E 



xt^X qt^/S.{y) yt^y yt~qtU Lt=i 



Y^^{yt,yt) - inf ^£(7rt(xi:t,yi:t_i),yf) 



T 



< (( sup inf sup E )) sup 

\xt^XqteA{y)yt^yyt~qtll^^l Ten 

< (( sup inf sup E sup 



T 

Y^i' {yt,yt){yt - M^ht,yht-i)) 

i=l 
T 



sup 



xt<^X qteA(y) yt<^y yt~qt ste[-L,L]ll ^^-^ Tren Lt=l 

where in the last step we passed to an upper bound by allowing for the worst-case choice sj of the 
derivative. We will often omit the range of the variables in our notation, and it is understood that 



16 



SfS range over [-L,L], while yt,yt over y and xt's over X. Now, by Jensen's inequality, we pass 
to an upper bound by exchanging E^^ and sup^^^g^: 

Y,^t(yt-n{xi:t,yi:t-i)) 
.t=i 

Y,^t{yt-n{xi..t,yi..t-i)) 

.t=i 

Consider the last step, assuming all the other variables fixed: 



sup inf E sup sup 

xt nt€A(y)yt~qt yt St 



sup 

Tren 



= (( sup inf sup 

Xt yteyyt,st 



t-1 



sup 



sup inf sup sup 

XT yr yT,ST Trell 



Y,st{yt-Tit{xi:t,yi:t-i)) 
t=l 

T 

sup inf sup E sup 

XT yr pT€A(yx[-L,L]) (yT,ST)~PT 



Y,St{yt-'!Tt(xi:t,yi:t-l)) 
t=l 



where the distribution px ranges over all distributions on yx[-L, L]. Now observe that the function 

inside the infimum is convex in yx, and the function inside supp^ is linear in the distribution pT- 
Hence, we can appeal to the minimax theorem, obtaining equality of the last expression to 



sup sup inf E 

XT pTsA{yx[-L,L]) Vt {.yT,ST)~VT 
T-1 

= Sty* + sup sup inf E 

t=l XT Pt ijT {yT,ST)~PT 



T T 

Y^styt- inf Y,st7:t{xi:t,yi:t-i)) 

t=l 7r6nt=i 
T 

STyr- inf Y,^tn{xi:t,yi:t-i)) 

7r6nt=i 



T-1 

X! ^tilt + sup sup 

t=l XT Pt 



inf 

VT \(yT,ST)~PT 



E St yr- E mfY,st7rt{xi:t,yi:t-i)) 



iyT,ST)~PT i"ent=i 



T-1 

X styt + sup sup E 

t=l XT PT {yT,ST)~PT 



inf 

Vt \(2/t,st)~Pt 



E ST\yT- ^niY^stTrt(xi:t,yi:t-i)) 



We can now upper bound the choice of by that given by ttt, yielding an upper bound 



T-1 



X styt + sup E sup 

t=l Xt,Pt {yT,ST)~PT TT^n 



T 



inf 

VT \iyT,ST)~PT 



E ST\yT-'Y.StTTt{xi..t,yi:t-l)) 



T-1 



= Z! ^tyt + sup E sup 

t=l XT,PT (yT,ST)~PT TT^n 



T-1 



E Sy-ST 7rT(xi:T,yi:T-l)- StTTt{xi:t,yi:t-l)) 
,s'rr,)~PT I t=l 



It is not difficult to verify that this process can be repeated for T - 1 and so on. The resulting 
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upper bound is therefore 



sup E I) sup 

\xt,Pt {yt,st)~ptll i-^i vren 

< (( sup E )) sup 



T 



t=l \(.y't,s'thpt I 



^t,Pt (yt,st)~ptll Tren Lt=l 
T 



{s't - St) 1Tt{xi:t,yi:t-l) 



= (sup E E sup 

\xt,Pt {yt,st)~pt<^tll TTsn 



Y,(^t{st- St)'ITt{xi:t,yi:t-l) 



,t=l 



< (( sup sup E )) sup 



xt (yt,st) <^tll TreU Lt=l 
t 

T 



Y^(^t{s[- St)'ITt{xi:t,yi:t-l) 



< (( sup sup E )) sup 

I T 



xt,yt s[,st <^tll Tren lt=l 



< 2 (( sup sup E)) sup 

xt,yt st^[-L,L] ''ill f-^i Tren 



Y,^tStTTt{xi:t,yi:t-l) 



t=l 



Since the expression is convex in each st, we can replace the range of st by {-L, L}, or, equivalently, 



V|(n) < 2L (( sup sup E sup 

xt,yt ste{-l,l} ^tll ^^-^ Tren 



^etSt7rt(xi:t,?/l:t_l) 



(9) 



Now consider any arbitrary function ip ■ {±1} M, we have that 

sup EjV(s-e)]= sup i(V;(+s) + V^(-s)) = ^(V(+l)+V(-l))=IEe[V'(e)] 

se{±l} se{±l} ^ ^ 

Since in Equation (9), for each t, st and et appear together as et • st using the above equation 
repeatedly, we conclude that 

I T 



V|(n) < 2L((supE)) sup 

\xt,yt ^tll Tren 



Y^etTTt{xi:t,yi:t-l) 



,t=l 



= sup E sup 

x,y e Tren 



X]etvri(xi:t(e),yi:t_i(e)) 
.t=i 



The lower bound is obtained by the same argument as in [11]. 



□ 



Proof of Theorem 3. Denote Lt{7r) - E*=i ^('^s(-Zi:s-i)) -^s)- The first step of the proof is an 
application of the minimax theorem (we assume the necessary conditions hold): 



2 es^(vrs((^;i:t,wi:5_i_i(e)),Zs_t(e)) -Lt(7r) 



inf sup < E [i{ft, zt)] + sup E sup 

qt<iA{T) zt^Z [ft~qt z,w et+i:T Tren 



s=t+l 



= sup inf \ E [i{ft,zt)]+ E sup E sup 

pteA(2) /tej^ l^t~Pt z,w et+i,T Tren 



2 ^ es£{-Ks{(zi:t,wi:s.t-i{e)),Zs-t{e)) - Ltiir) 



s=t+l 
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For any pt e A(Z), the infimum over ft of the above expression is equal to 



E sup E sup 

zt~pt z,w et+i:T Trell 



2 esi{Trs{{zi:t,wi:s-t-i{e)),Zs-t{e)) - Lt-iin) 
s=t+l 



< E sup E sup 

zt~pt z,w et+i:T Trell 



+ inf E [i{ft,zt)]-£{Mzi:t~i),zt) 



2 es^(7rs((zi:t,wi:s_t_i(e)),Zs-f(e)) - Lt_i(7r) 

+ E [i{Tit{zv.t-i).zt)]-i{T:t{zi..t-i),zt) 



Zt ~Pt 



< E sup E sup 

zt,z'~pt €t+v.T Trell 



2 es^(vrs((zi:t, wi:s_i_i(e)),Zs_t(e)) - Lt_i(7r) 



l-^(7rt(zi:t_i), Z^') - £{TTt{zi:t-l),Zt)] 



We now argue that the independent Zt and have the same distribution pt, and thus we can 
introduce a random sign ct- The above expression then equals to 



E Esup E sup 

Zt,z'^~pt z,w et+i:T Trell 



T 

2 ^ es£(7rs((2;i:t-i,Xt(et),vifi:s-t-i(e)),Zs_t(e)) -Lt_i(7r) 

+et(^(7rt(zi:t_i),xt(-et))) -^(vrt(zi:t_i),xt(et)))] 

T 



< E sup Esup E sup 2 ^ esi{TTs{{zi:t-l,Xtiet),^l:s-t-l{e)),Zs-t{€)) - Lt-l{TT) 
zt,z[~pt z",z"' et z,w €t+i:T Trell L s=t+l 

+ et{£{'7Tt{zi:t-l), z't') - e{7rt{zi:t-l), z't"))] 



Splitting the resulting expression into two parts, we arrive at the upper bound of 



2 E sup Esup E sup 

zt,z[~pt z" et z,w et+i:T Trell 

T 

< sup Esup E sup 

z,z',z" et z,w tt+v.T Tren 

<inT(n|zi,...,zt_i). 



1 

^ es£{TTs{{zi:t-l,Xt{et),Wi:s-t-l{e)),Zs-t{e)) - -Lt-l{lT) +eti{7Ttizi:t-l),z") 



Y 2e5£(7rs((2:i:t_i,xt(et),viri.^_t_i(e)),Zs_t(e)) - Lt_i(7r) + etl{'nt{z\^-\),zl) 



The first inequality is true as we upper bounded the expectation by the supremum. The last 
inequality is easy to verify, as we are effectively filling in a root zt and z\ for the two subtrees, for 
et = +1 and et = -1, respectively, and jointing the two trees with a root. 

One can see that the proof of admissibility corresponds to one step minimax swap and symmetriza- 
tion in the proof of [11]. In contrast, in the latter paper, all T minimax swaps are performed at 
once, followed by T symmetrization steps. □ 



Proof of Lemma 4- The first step of the proof is an application of the minimax theorem (we 
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assume the necessary conditions hold): 



inf sup { E ffZt+ E max 



^a-(e) 1= sup M Iff E zt+ E E max 

i=S J Pt<^A(Z) ft<^T { Zt~pt et + i:T l<s<T 



gteA(JP) zteZ [ft~qt 

For any e A(-E), the infimum over ft of the above expression is equal to 



i=s 



E zt 

zt~pt 



+ E E max I max 



i=s 



T T ^ 

E«i(0 ,rnax f 



< E E max I max 



max 



z[~pt 



< E E max •! max 

Zt,z[~ptet+v.T 



, max 

s<t 



i>s,ii:t 



We now argue that the independent zt and z'f- have the same distribution pt, and thus we can 
introduce a random sign e£. The above expression then equals to 



E E max •! max 

zt,z'^~ptet:T 



< E E max I max 

zt~Pt ep.T 



E«'(e) 

Ea*(e) 



, max 

s<t 



, max 



E «i(e) + et(4-2;t) 



i>s,i+t 



Now, the supremum over is achieved at a delta distribution, yielding an upper bound 



sup E max •! max 



T 



E«*(^) 



, max 

s<t 



E a-(e)+2etZi 



< E max •! max 



E«*(e) 



, max 

s<t 



E a*(e) + 2e^ 



= E max 



T 



E«r'(^) 



□ 



Proof of Lemma 6. Denote 



s+a-2 



1 + 



3-1 



^t(«,/3) = E 

I _1_ 

s+a-2 

The first step of the proof is an application of the minimax theorem: 

1 



inf sup I E \ft- zt\ + E sup 



T 

2 E 



t+1 1 + 



— -Lt{a,l3) 



i+a-2 



= sup inf \ E \ft - zt\ + E sup 



1 



2 E « 

s=t+l 1 + — 



— - Lt{a,l3) 



s+a-2 
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For any pt e A(Z), the infimum over ft of the above expression is equal to 

1 



E Ee sup 



T 

2 E e 



s=ui ' 1 + ^ 



< E Ee, sup 



s+a-2 
1 



--Lt_i(a,/3)+ inf E 



t+a-2 



1 + 



-Zt 



t+a-2 



T 

s=t+l 1 + — 



2 E ^. 



s+a-2 
1 



--Lt_i(a,^)+ E 

z'~pt 



t+a-2 



1 + 



i+a-2 



t+a-2 



1 + 



-1 



t+l 1 + 



/3 



— -Lt_i(a,^) + 



s+a-2 



mil 

t+a-2 



1 + 



t+a-2 



t+a-2 
t+a-2 



1 + 



-1 



Zt 



t+a-2 



We now argue that the independent Zt and z'^ have the same distribution pt, and thus we can 
introduce a random sign et- The above expression then equals to 



E E,,E,,^,^^sup 



2 E 



1 



=m ^ 1+^ 



— -Lt_i(a,/3) + et\ 



s+a-2 



t+a-2 



1 + 



^-1 
t+a-2 



t+a-2 



1 + 



3-1 



t+a-2 



< sup Eej.y sup 



1 



T 

2 E T 

s=t+l 1 + — 



— -Lt_i(a,^) + et 



s+a-2 



\ 



t+a-2 



1 + 



;3-l 

t+a-2 



mill 

t+a-2 



1 + 



t+a-2 



2t 



where we upper bounded the expectation by the supremum. Splitting the resulting expression into 
two parts, we arrive at the upper bound of 



2 sup ^et-T sup 

ztiZ a,/3 



T 

E ^s- — 

s=t+l 1 + 



t+a-2 



2 sup Eg^.y sup 

zt€2 a,/9 



T 

E 



s=t+l 



s+a-2 

— -\^t-i{a,l3) + et 



1 + 



-1 



^t 



t+a-2 

t+a-2 



1 + 



= 2Eej.j, sup 

a,/3 



E 



/3- 
s+a-2 



1 



1 + 



/3- 



— (1 -22;t) -etzt 



t+a-2 



t+l 1 + 



— -:^^t-i(a,/3) + ef 



t+a-2 
1+ ' 



s+a-2 t+a-2 . 

where the last step is due to the fact that for any e {0, 1}, et{l-2zt) has the same distribution 
as et- We then proceed to upper bound 



2supEa^pEg^.y sup 

P a,/3 



< 2 sup Eg^.j, sup 

ae{±l} a, 13 



E 



=t+i 1 + 



E 



t+l 1 + 



- ^Lt-i{a, P) + et 

s+a-2 

■^^gT]- - ^Lt-i{a, + et ■ 



a 



1 + 



t+a-2 

a 



< 2Eej.j, sup 

a,/3 



T 

s=t 1 + 



s+a-2 



1 + 



/3-1 
t+a-2 



s+a-2 



The initial condition is trivially satisfied as 



Rel(zi:r) = -inf^ 

a,l3 s=l 



S+a-2 



1 + 



3-1 



s+a-2 
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□ 



Theorem 9. The conditional sequential Rademacher complexity with respect to 11 



Gxi^j^lzi, zt) ^ sup E sup 

Z,W (Tt + 1:T Trell 

is admissible. 



T 

z 

s=t+l 



s=l 



Proof of Theorem 9. Denote Lt{'K) - Es=i ^(^s(-2i:s-i), -^s)- Let c = E^. \a\ - \/2fn. The first step 
of the proof is an application of the minimax theorem (we assume the necessary conditions hold) : 



inf sup j E zt)] + sup E sup 



- o-s^(7rs((2i:t,wi:s-t-i(e)),Z5_t(e)) - Lj(7r) 



= sup inf E [i{ft,zt)]+ E sup E sup 

pteA(2) ft<^r {""t-pt zt~pt z,w at+i:T Tren 



E Crs-^(^s((^l:t,Wi:5_t_i(e)),Zs-t(e)) -Lt(7r) 



s=t+l 



For any pt e A(Z), the infimum over ft of the above expression is equal to 



E sup E sup 

Zt~Pt z,w at+i:T Tren [_ ^ s=t+l 



T 



Y (7si{lTs{{zi:t,y^l:s-t-l{(^)),Zs-t{e)) - Lt-l(vr) 



+ inf E [i{ft,zt)]-i{7rt{zi,t-i),zt) 

f,eyrzt~pt 



< E sup E sup 



Zt~pt z,w at+i:T Tren [_ ^ s=t+l 



+ E [liTrtizi..t^i),zt)]-eiTrtizi..t-i),Zt) 



zt~Pt 



< E sup E sup 

Zt,z'~pt z,w crt+i:T Tren 



E crs^(7rs((^i:t,wi:s-t-i(e)),Zs_t(e)) -Lt-iin) 



s=t+l 



-i{lTt{zi:t-l),zl.) ~ £{TTt{zi:t-l),Zt)] 



We now argue that the independent zt and z'^ have the same distribution pt, and thus we can 
introduce a gaussian random variable at and a random sign et - sign((Tt). The above expression 
then equals to 



E Esup E sup 



zt,z[~pt <^t z,w (Jt+v.T Tren L ^ s=t+l 



- cJs^(7rs((2;i:t_i,xt(et),wi:s_t_i(e)),Zs_f(e)) - Lt_i(7r) 



+et(£(7rt(2:i:t-i),xt(-et))) -^(vrt(zi:t_i),xt(et)))] 



< E Esup E sup 

Ztjz'i.-pt^t z,w at+i:T Tren 



- E crs^(^s((^i:t-i'Xt(et),wi:^_i_i(e)),Zs-t(e)) -Lt_i(7r) 



+etE 



Wvrt(zi:t-i),Xt(-Q)))-^(vrt(zi:t_i),Xt(et))) 
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Put the expectation outside and use the fact €t\(7t\ - (Jt, we get 



E Esup E sup 

Zt,z[~pt'^t Z,W (Tt+\:T Trell 



- o-s^(7rs((^i:t-i,Xt(ei),wi:s_t_i(e)),Zs_i(e)) - Lt_i(7r) 
+^(£(7ri(zi.,_i),xt(-et)))-^(^t(^i:t-i),Xi(ei)))] 



< E sup Esup E sup 

zt,z[~pt z",z"' crt z,w iJt+v.T tteII L s^l 



T 



- E es^(7rs((2;i:t-i,Xt(et),wi:s_t_i(e)),Zs_t(e)) -Lt_i(7r) 



+ — {^{^Tt{zl,t-l),z^) - (.{^Tt{zl,t^l),z^')) 



Sphtting the resulting expression into two parts, we arrive at the upper bound of 



2 E sup E sup E sup 

zt,z[~pt z" CTt z,w (Tt+i:T vrell L 



- Yj o-s^(^s((^i:t-i'Xt(et),wi:s-t-i(e)),Zs_t(e)) - -L(_i(7r) 



< sup Esup E sup 

z,z'z" (Jt z,w at+v.T Trell 



< QT{i,'n\zi, . . .,zt-i). 



+—i{-Kt{zi.,t-i),Z^) 



- E c7s£(7rs((2;i:t-i,Xt(et),wi:s_t_i(e)),Z5_i(e)) -Lt_i(7r) 



+ — t{TTt{zi:t-l),z'l) 
C 



□ 

Proof of Lemma 5. Let qt be the randomized strategy where we draw et+i, ■ ■ ■ ,eT uniformly at 
random and pick 



gj(e) = argmin sup < E ff zt + max 



qe[-l,l] zte{-l,l} l/t~9 



l<s<T 



(10) 



Then, 



sup \ E ft- zt + ^et+i-T ™ax 

zM-l,l} [ft~qt ■ ^^"^'^ 



i=s 



sup \ E, E ft- zt+ Eej^i^j, max 



2te{-i,i} 
<E,., 



ft~qt(e) 



l<s<T 



sup E ff zt+ max 



= E 



inf sup I E ff zt+ max 
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where the last step is due to the way we pick our predictor /t(e) given random draw of e's in 
Equation (10). We now apply the minimax theorem, yielding the following upper bound on the 
term above: 



^t + l:T 



sup inf < E ff zt+ E max 

.pteA(2) ft [zt~pt 



zt~pt l<s<T 



This expression can be re-written as 

sup E inf \ E ff z[ + max 



E, 



^t+l:T 



< E, 



pteA(^)2t~Pt ft \z[~pt 



sup E 
pteA(2) zt~pt 



E z[ 



z'~pt 



l<s<T 



+ max 

l<s<T 



i=s 



<E, 



Ef + 1:T 



sup E max \ max 

pt^A(Z)zt~Pt 



s<t 



E«*(e)+ E 4 



z'~pt 



, max 

s>t 



<E, 



^t+l:T 



sup E max •! max 
pteA(Z)zt,zi~pt I 



, max 



E«*(e) 



We now argue that the independent zt and have the same distribution pt, and thus we can 
introduce a random sign et- The above expression then equals to 



E, 



sup E E max -i max 

pt^/^{Z)zt,z[~pt^t \ 
T 



< E sup E max 

ct+iiT zt^{-l,l} isssT' 



E«r'(^) 



i>s.ii^t 



E max 



, max 

s>t 



E«*(e) 



E 



□ 



Proof of Lemma 7. Given an A"- valued tree x and a 3^- valued tree y, let us write Xi(e) for the 
matrix consisting of (xi(e), . . . ,Xf_i(e)) and Yt for the vector (yi(e), . . . ,yt_i(e)). By Theorem 2, 
the minimax regret is bounded by 



4 sup Eg sup 

x,y T^\,wQ^YlV~i 
T 

4 sup Ee sup 
x,y A, wo 



^et7r4^''""(xi:j(e),yi:t_i(e)) 

t=i 

^6ic(((Xi(erXt(e) + A/)-iXi(e)^Yt(6),xt(6)) + {w,,^t{e))) 



t=i 



Since the output of the clipped strategies in 11 is between -1 and 1, the Dudley integral gives an 
upper bound 

lH(n,(x,y)) <inf j4aT + 12Vr f^^log A/2(n, (x, y), 5) dd] 
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Define the set of strategies before clipping: 

n' = {vr' : 4{xv.uyv.t-i) = + {X^X + XI)-^X^Y,xt) , \\wo\\ < 1, A > A„,in} 

If y is a 6-cover of 11' on (x, y), then V is also an 6-covei of 11 as |c(x) - c(x')\ < \x -y\. Therefore, 
for any (x,y), 

AA2(n,(x,y),<5)<M(n',(x,y),<5) 

and 

9\iU, (x, y)) < inf {^aT + uVT T ^log A^aCH', (x, y), 5) ds] . 

If is a 5/2-cover of the set of strategies II^q = {{'Wo,xt{e)) ■ \\wo\\ < 1} on a tree x, and A is a 
(5/2-cover of the set of strategies 

Ux^{7t: 7Tt{xr.t,yi:t-i) = {{X^X + XI)-^X^Y,xt) ■■ A > X^,n} 

then X A is an 5-cover of 11'. Therefore, 

AA2(n',(x,y),5) <AA2(n^o,(x,y),(^/2) x AAsCHa, (x, y), 5/2). 

Hence, 

lH(n,(x,y)) <inf{4ar + 12\/T f Vlog AA2(n^,, (x,y),<5/2) + log AAaCn^, (x, y), <5/2) ds] 
< inf {4ar + 12\/r f'^log AA2(n^o, (x,y),(5/2) dd] 

+ £ Vlog Ar2(nA,(x,y),5/2) d6 

The first term is the Dudley integral of the set of static strategies n^„(, given by wq e -62(1); and it 
is exactly the complexity studied in [11] where it is shown to be 0{\jT log(T)). We now provide a 
bound on the covering number for the second term. It is easy to verify that the following identity 
holds 

{X^X + A2/d)-^ - {X^X + Ai/d)-^ = (Ai - A2)(X^X + XihT\X^ X + A2/d)-^ 

by right- and left-multiplying both sides by {X^ X + X2ld) and {X~^ X + Xild), respectively. Let 
Ai, A2 > 0. Then, assuming that \\xt\\2 ^ 1 and yt e [-1, 1] for all t, 



UXtX + X2ldT^X^Y - {X^X + Xiidr^x^Yl 



2 

IA2 - All II {x^x + Xiid)-\x^x + X2idr^x^YL < IA2 - Aii-^ ||x^y |L < Iai^ - X2^\t 

112 X1X2 



Hence, for |Ar^ - A2^| < 5/T, we have \\{X''X + X2ldy^X^Y - (X^X + Xildy^ X''Y\\^ < 6, and thus 
the discretization of A"^ on (0, A^^^] gives an -cover, and the size of the cover at scale 6 is 
X^^^T6^^. The Dudley entropy integral yields the bound of, 

9\{U, (x,y)) < 12Vt£ ^logi2TX-J^5-^)d5 < uVfl^l + ^log(2TA^;i j) . 
This concludes the proof. □ 
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Proof of Lemma 8. Using Theorem 1 , 



VtCHa) < 2fH(£, Ha) = 2 supE^ sup 

z,z' wo:|] Wo ||<l,AeA 



t=l \ 



niax{A(||i:*:i'z,(e)||,t),||E*:iz,(e)|} 
which we can upper bound by sphtting the supremum into two: 



2 sup Eg sup 

z' Wo:|!wo||<l 



+ 2 sup Eg sup 

z,z' AeA 



max{A(|E*:iiz,(e)||,t),||E*:iiz,(e)| 



4(0 



The first term is simply 



2 supE, 



T 



2Vf. 



The second term can be written as 

E*:iz,(e) , 



2 sup Ee sup 

z,z' AeA 



T 



< 2 sup sup Ee sup 

z s AeA 



|E*:iz,(6) 







^(^)|| 




max{A(||E-:izi(e) 






} 



.i=l 





Edz.(6)|| 




max {A (1 Ed 




||El1z.(6) 


} 



and the tree s can be erased (see end of the proof of Theorem 2), yielding an upper bound 



2 sup Ee sup 

z AeA 



T 

E 





yt-l 
Ai=l 


Zi(e) 




max {A (1 Ed 


Zi(e) 




Edz,(6) 


} 



< 2 supEe sup 

a AeA 



< 2 supEe sup 

a AeA 



= 2 sup Ee sup 

a AeA 

= 2 supEe sup 

b 7Er 

< 2 7^T(^) 



eiat(e) 



T 

V 

itt max{A(at(e),t) ,at(e)} 



T 

E 



t=i max 



r A(at(e),t) ,1 

V ■ ill 



E6n(b*(6),l/t) 



where in the above a is a M^'^-valued tree such that a^ : {±1} ^ [0,t- 1], b is a [1/T, l]-value tree 



and r = {7 : V6 e [1/T, 1], a e [0, 1], 7(0, b) = min { ^^^^^^^g^ , l} , A e a}. 



□ 
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